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Recovery of Ego-Motion Using Image Stabilization 

Michal Irani"'' Benny Roueso Shmuel Peleg 
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91904 Jerusalem, ISRAEL 



Abstract 

A method for computing the 3D camera motion 
(the ego-motion) in a static scene is introduced, which 
is based on computing the 2D image motion of a sin- 
gle image region directly from image intensities. The 
computed image motion of this image region is used to 
register the images so that the detected image region 
appears stationary. The resulting displacement field 
for the entire scene between the registered frames is af- 
fected only by the 3D translation of the camera. After 
canceling the effects of the camera rotation by using 
such 2D image registration, the 3D camera translation 
is computed by finding the focus-of-expansion in the 
translation-only set of registered frames. This step is 
followed by computing the camera rotation to com- 
plete the computation of the ego-motion. 

The presented method avoids the inherent prob- 
lems in the computation of optical flow and of feature 
matching! and does not assume any prior feature de- 
tection or feature correspondence. 

1 Introduction 

The motion observed in an image sequence can be 
caused by camera motion (ego-motion) and by mo- 
tions of objects moving in the scene. In this paper we 
address the case of a camera moving in a static scene. 
Complete 3D motion estimation is difficult since' the 
image motion at every pixel depends, in addition to 
the six parameters of the camera motion, on the depth 
at the corresponding scene point. To overcome this 
difficulty, additional. constraints are usually added to 
the motion model or to the environment model. 

3D motion is often estimated from the optical or 
normal flow derived between two frames [1, 12, 22], 
or from the correspondence of distinguished features 

*This research has been sponsored by the U_S. Office ofNavaJ 
Research under Giant N00014-9S-1-12D2, R&T Project Code 
4424341—01- 

T M- Irani ia now with David Soraofl Research Center. 



(points, lines, contours) extracted from successive 
frames [10, 13, 7] Both approaches depend on the 
accuracy of the feature detection, which can not al- 
ways be assured. Methods for computing the ego- 
motion directly from image intensities were also sug- 
gested [11, 14]. 

Camera rotations and translations can induce sim- 
ilar image motions [2, 8] causing ambiguities in their 
interpretation. At depth discontinuities, however, it 
is much easier to distinguish between the effects of 
camera rotations and camera translations, as the im- 
age motion of neighboring pixels at different depths 
will have similar rotational components, but differ- 
. ent translational components. Motion parallax meth- 
ods use this effect to obtain the 3D camera motion. 
[13, 17, 7]. Other methods use motion parajlax for 
shape representation and analysis [23, 6, 9]. 

In this paper a method for computing the ego- 
motion directly from image intensities is introduced. 
At first only 2D image motion is extracted, and later 
this 2D motion is used to simplify the computation of 
the 3D ego-motion. 

We use previously developed methods [15, 16] to 
detect and track a single image region and to com- 
pute its 2D parametric image motion. It is important 
to emphasize that the 3D camera motion cannot be 
recovered solely from the 2D parametric image mo- 
tion of a single image region, as there are a couple of 
such 3D interpretations [20]. It was shown that 3D 
motion of a planar surface can be computed from its 
2D affine motion in the image and from the motion 
derivatives [21] , but motion derivatives introduce sen- 
sitivity to noise. Moreover, the problem of recovering 
the 3D camera motion directly from the image motion 
field is an ill-conditioned problem, since small errors in 
the 2D flow field usually result in large perturbations 
in the 3D motion [2]. 

To overcome the difficulties and ambiguities in the 
computation of the ego-motion, we introduce the fol- 
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lowing scheme: The first frame is warped towards the 
second frame using the computed 2D image motion at 
the detected image region. This registration cancels 
the effects of the camera rotation for the entire scene, 
and the resulting image displacements between the 
two registered frames are due only to the 3D transla- 
tion of the camera. This translation is computed by lo- 
cating the FOE (focua-of-expansion) between the two 
registered frames. Once the 3D translation is known it 
can be used, together with the 2D motion parameters 
of the detected image region, to compute the 3D rota- 
tion of the camera by solving a set of linear equations. 

The 2D image region registration technique used in 
this work allows easy decoupling of the translation al 
and rotational motions, as only motion parallax infor- 
mation remains after the registration. As opposed to 
other methods using motion parallax [18, 19. 17, 7], 
our method does not rely on 2D motion information 
computed near depth discontinuities, where it is inac- 
curate, but on motion computed over an entire image. 
The effect of motion parallax is obtained at all scene 
points that are not located on the extension of the 3D 
surface which corresponds to the registered image re- 
gion. This gives dense parallax data, as these scene 
points need not be adjacent to the registered 3D sur- 
face. 

The advantage of this technique is in its simplicity 
and in its robustness. No prior detection and matching 
are assumed, it requires solving only small sets of lin- 
ear equations, and each computational step is stated 
as an overdetexmincd problem which is numerically 
stable 

2 Ego-Motion from 2D Image Motion 

In this section we describe the technique for com- 
puting the 3D ego-motion given the 2D parametric 
motion of a single image region. Hie method for au- 
tomatically computing the 2D motion of a single image 
region is briefly described in Sec 4 

2.1 Basic Model and Notations 

Let (X, y, Z) denote the Cartesian coordinates of 
a scene point with respect to the camera (see Fig. 1), 
and let (x> J/) denote the corresponding coordinates in 
the image plane. The image plane is located at the 
focal length: Z = / e . The perspective projection of 
a scene point P = (X, Y t Z) x on the image plane at a 
point p = (a;, y) x is expressed by: 

•-[:]-[fsf « 

The camera motion has two components: a translation 
T {Tx ,Ty ,Tz)* and a rotation Q = (fl*, ^y^z) 1 



■ - — 


* 







Figure 1: The coordinate system. 
The coordinate system (X t Y % Z) is attached to the cam- 
era, and the corresponding image coordinates (s, y) on 
the image plane arc located at Z = fc- A point P = 
[X t y, Zy in ihc world is projected onto aoi image point 
p = (x,y)'. T = [TxtTy^Tz) 1 and Q = (^.fty,^)' 
represent the relative translation and rotation of the 
camera in the scene. 



Due to the camera motion the scene point P = 
(X t Y, Z) 1 appears to be moving relative to the camera 
with rotation — Q and translation — T, and is therefore 
observed at new world coordinates P = [X , Y , Z )\ 
expressed by: 



" X' ' 




" X ' 




Y' 


= M_ n - 


Y 


-T (2) 


. Z' . 




2S 





where M^n is the matrix corresponding to a rotation 
by 

When the field of view is not very large and the 
camera motion has a relatively small rotation [1], the 
2D displacement (u, v) of an image point (a, y) in the 
image plane can be expressed by [20, 3): 



(3) 



The following is noted from Eq. (3): 

• Since all translations are divided by the unknown 
depth Z, only the direction of the translation can 
be recovered, but not its magnitude. 

• The contribution of the camera rotation to the 
displacement of an image point is in dependent of 
the depth Z of the corresponding scene point. 

All points [X t Y, Z) of a planar surface in the 3D scene 
satisfy a plane equation Z = A+B X+C-Y, which can 
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be expressed in terms of image coordinates by using 
Eq. (l)as: 

± = a + 0-x+j.y. (4) 

In a similar manipulation to that in [1], substituting 
Eq. (4) in E*. (3) yields: 

[u 1 _ T a + bx + c-y + g-x^ + h-xy ] , * 
v \ -[ d+t-x + f y + g-xy + h y 3 J W 



where: 



a = ~f c aTx - /cHy 
b = aTz - fcPTx 

c = n z - f c7 T x 

d=-f c aT Y +f t n x 



f=c*Tz- fclTy 



(6) 



Eq. (5) describes the 2D parametric motion in 
the image plane, expressed by eight parameters 
(a, b y c, d } e, /, g , A), which corresponds to a general 3D 
motion of a planar surface in the scene, assuming a 
small field of view and a small rotation. We call 
Eq. (5) a pseudo 2D projective transformation, since 
under these assumptions, it is a good approximation 
to the 2D projective transformafcion- 

2.2 General Framework of the Algorithm 

In this section we present a scheme which utilizes 
the robustness of the 2D motion computation for com- 
puting 3D motion between two consecutive frames; 

1. A single image region is automatically detected, 
and its 2D parametric image motion is computed 
(Sec 4). 

2. The two frames are registered according to the 
computed 2D parametric motion of the detected 
image region. This image region stabilization can- 
cels the rotational component of the camera mo- 
tion for the entire scene (Sec. 2.3), and the camera 
translation can now be computed from the focus- 
of-expansion between the two registered frames 
(Sec. 2.4). 

3. The 3D rotation of the camera is now computed 
(Sec. 2.5) given the 2D motion parameters of the 
detected image region and the 3D translation of 
the camera. 

2.3 Eliminating Camera Rotation 

At this Btage we assume that a single image region 
with a parametric 2D image motion has been detected, 
and that the 2D image motion of that region has been 
computed. The automatic detection and computation 
of the 2D image motion for planar 3D surfaces is de- 
scribed in Sec- 4. 



Let (u(x,y),t?(x,y)) denote the 2D image motion 
of the entire scene from frame ft to frame /a, and 
let (uj(s,y), v a (z,y)) denote the 2D image motion of 
a single image region (the detected image region) be- 
tween the two frames. It was mentioned in Sec 2.1 
that (uj,Uj) can be expressed by a 2D parametric 
transformation in the image plane if the image re- 
gion is an image of a planar surface in the 3D scene 
(Eq. (5)). Let s denote the 3D surface of the detected 
image region, with depths Z 9 {x,y)- Note that only 
the 2D motion parameters [u s [z,y) y v 9 {x,y)) of the 
planar surface are known. The 3D position or motion 
parameters of the planar surface are still unknown. 
Let fP denote the frame obtained by warping the en- 
tire frame f\ towards frame /? according to the 2D 
parametric transformation (u,, v,) extended to the en- 
tire frame. This warping will cause the image region 
of the detected planar surface, as well as scene parts 
which are coplanar with it, to be stationary between 
//* and /2 In the warping process, each pixel (ar,y) 
in fi is displaced by (v s (x,y),v,(z i y)) to form 
3D points that are not located on the surface s (i.e., 
Z{x y y) ^ Z 9 {x, y)) will not be in registration between 
/f and / 2 - 

We will now show that the 2D image motion be- 
tween the registered frames, (/jf 1 and h) is affected 
only by the camera translation T. 

Let Pi = (Xi t Yi,Ziy denote the 3D scene point 
projected onto pi ;= (tfi,2/i)* in A. According to 
Eq (0 : = (siT^Pi^-.^i)^ Due to tte camera 
motion (0,r) from frame f\ to frame / 2l the point P\ 
will be observed in frame / 2 at j> 3 = (x?,^)*) w hich 
corresponds to the 3D scene point P? ~ (X? t Y7,Zi) t . 
According to Eq. (2): 



P2 = 



X 2 



(7) 



The warping of f\ by (u 9i Vj) to form f{* is equivalent 
to applying the camera motion (0,!T) to the 3D points 
as though they are all located on the surface $ (i.e., 
with depths 2 s (x y y)). Let P t denote the 3D point 
on the surface 5 which corresponds to the pixel (a;, y) 
with depth Z,(x, y). Then: 



t 



L z, 



x 1 

Z, 



(8) 



After the image warping, P z is observed in /f* at p R = 
(x R t y*) 1 , which corresponds to a 3D scene point P R . 
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Therefore, according to Eq. (2) and Eq. (8): 
and therefore: 

Using Eq, (7) we get: 

P 2 = M- a -P L -T 

and therefore: 

-p fl = |- ft + (i-|) (-r) . (9) 

Eq. (9) shows thai the 3D motion between P R and P 2 
is not affected by the camera rotation U } but only by 
its translation T. Moreover, it shows that P n is on 
the straight line going through P 2 and -7*. Therefore, 
the projection of P* on the image plane (p R ) is on the 
straight-line going through the projection of P 2 on the 
image plane (i.e., p 2 ) and the projection of —T on the 
image plane (which is the FOE). This means that p* 
is found on the radial line emerging from the FOB 
towards pi. In other words, the motion between the 
registered frames f R and / 2 (Le.,./?* -p 3 ) is directed 
towards^ or away from, the FOE, and is therefore in- 
duced by the camera translation T . 

In Fig. 2, the optical flow is displayed before and 
after registration of two frames according to the com- 
puted 2D motion parameters of the image region 
which corresponds to the wall at the back of the scene. 
The optical flow is given for display purposes only, and 
was not used in the registration. After registration, 
the rotational component of the optical flow was can- 
celed for the en/ire scene, and all flow vectors point 
towards the real FOE (Fig, 2.c). Before registration 
(Fig 2.b) the FOE mistakenly appears to be located 
elsewhere (in the middle of the frame). This is due 
to the ambiguity caused by the rotation around the 
Y-axis, which visually appears as a translation along 
the X-axis. This ambiguity is resolved by the 2D reg- 
istration. 

2.4 Computing Camera Translation 

Once the rotation iB canceled by the registration 
of the detected image region, the ambiguity between 




Figure 2: The effect of region registration. The real 
FOE is marked by +- 

a) One of the frames- 

b) The optical flow between two adjacent fiamcs (be- 
fore registration), ovexlayed on Fig. 2. a. 

c) The optical flow after 2D registration of the wall. 
The flow is induced by pure camera translation (after 
the camera rotation was canceled), and paints dow to 
the correct FOE. 

d) The computed depth map. Blight regions corre- 
spond to close objects. 



image motion caused by 3D rotation and that caused 
by 3D translation no longer exists. Having only cam- 
era translation, the flow field is directed to, or away 
from, the FOE/ The computation of the 3D transla- 
tion therefore becomes ovexdetermined and numeri- 
cally stable, as the only two unknowns indicate the 
location of the FOE in the image plane. 

To locate the FOE, the optical flow between the 
registered frames ifi computed, and the FOE is located 
using a search method similar to that described in [18]. 
Candidates for the FOE are sampled over a half sphere 
and projected .onto the image plane. For each such 
candidate, a global error measure is computed from 
local deviations of the flow field from the radial lines 
emerging from the candidate FOE. The search process 
is repeated by refining the sampling (on the sphere) 
around good FOE candidates. After a few refinement 
iterations, the FOE is taken to be the candidate with 
the smallest error. 

Since the problem of locating the FOE in a purely 
trajwlaiional flow field ifi a highly overdetermined 
problem, the computed flow field need not be accurate. 
This is opposed to moat methods which try to com- 
pute the ego-motion from the flow field, and tequire 
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an accurate flow field in order to resolve the rotation- 
translation ambiguity. 

2.5 Computing Camera Rotation 

Let (a,fc,c,d, e, f t $, h) be the 2D motion param- 
eters of the 3D planar surface corresponding to the 
detected image region, as expressed by Eq. (5). Given 
these 2D motion parameters and the 3D translation 
parameters of the camera (Tx i Ty,Tz) l the 3D rota- 
tion parameters of the camera (fix,(]y,^) (as well 
as the planar surface parameters (a, P, y)) can be ob- 
tained by solving Eq. (6), which is a set of eight linear 
equations in sir unknowns. 

iFtom our experience, the parameters g and h in 
the pseudo 2D projective transformation, computed 
by the method described in Sec 4, are not as reliable 
as the other six parameters (a, ft, c, d, e, /), as g and h 
are second order terms in Eq, (5). Therefore, when- 
ever possible (when the set of Eq. (6) is numerically 
overdetermined), we avoid using the last two equations 
(for g and h), and use only the first six. This yields 
more accurate results*. 

As a matter of fact, the only cade in which all eight 
equations of (6) must be used to recover the camera 
rotation is the case when the camera translation is 
parallel to the image plane (i.e., T f^0 and Tz =0). 
This is the only configuration of camera motion in 
which the first six equations of (0) do not suffice for 
retrieving the rotation parameters. However, if only 
the first six equations of (6) are used (i.e., using only 
the reliable parameters a, 6, e, d, e, /, and disregarding 
the unreliable ones, g and A), then only Qz can be re- 
covered in this case. In order to recover the two other 
rotation parameters, Clx and 1 fly, the second order 
terms g and h roust be used. This means that for the 
case of an existing translation with Tz = 0, only the 
translation parameters (Tx , Ty , Tz) and one rotation 
parameter, Q# (the rotation around the optical axis), 
can be recovered accurately. The other two rotation 
parameters, tlx and fiy, can only be approximated. 

In all other configurations of camera motion the 
camera rotation can be reliably recovered. 

2.6 Experimental Results 

The camera motion (In cm) between the two 
frames in Fig. 2 was: (Tx,T Yt T z ) = (1.7, 0.4, 12) and 
(Ojr.tty.Q*) = (O 0 ,-!.©* 0 ,-^ 0 ). The computation of 
the 3D motion parameters of the camera (after setting 
T z = 12) yielded: (Tx>Ty t Tz) = (1.68,0.16, 12) and 

{n Xi n Yt az) = (-o.os<\ -1.7% -3,25*). 

Once the 3D motion parameters of the camera 
are computed, the 3D scene structure can be recon- 
structed using a scheme similar to that suggested in 



[11]. Correspondences between small image patches 
(currently 5x5 pixels) are computed only along the 
radial lines emerging from the FOE (taking the rota- 
tions into account). The depth map is computed from 
the magnitude of these displacements. In Fig- 2 d, the 
computed inverse depth map of the scene ( ^ y j ) is 
displayed. 

3 Camera Stabilization 

Once the ego-motion of the camera is determined, 
this information can be used for post-imaging stabi- 
lization of the sequence, as if the camera ha3 been 
mounted on a gyroscopic stabilizer. 

For example, to make perfect stabilization, the im- 
ages can be warped back to the original position of the 
first image to cancel the computed 3D rotations. Since 
rotation is depth-independ^t, such image warping is 
easy to perform, resulting in a new sequence which 
contains only 3D translations, and looks as if taken 
from a stabilized platform- An example of such sta- 
bilization is shown in Fig. 3.d . Alternatively, the ro- 
tations can be filtered by a low-pass filter so that the 
resulting sequence will appear to have only smooth 
rotations, but no jitter. 

4 Computing 2D Motion of a Planar 
Surface 

We use previously developed methods [15, 16] in 
order to detect an image region corresponding to a 
planar surface in the scene with its pseudo 2D projec- 
tive transformation. These methods treated dynamic 
scenes, in which there were assumed to be multiple 
moving planar objects. The image plane was seg- 
mented into the differently moving objects, and their 
2D image motion parameters were computed. 

In this work we use the 2D detection algorithm in 
order to detect a single planar surface and its 2D image 
motion parameters. Due to camera translation, planes 
at different depths or orientations will have different 
2D motions in the image plane, and will therefore be 
identified as differently moving planar objects. When 
the scene is not piece wise planar, but contains planar 
surfaces, the 2D detection algorithm still detects the 
image motion of its planar regions. 

In this section we describe very briefly how the tech- 
nique for detecting multiple moving planar objects 
locks onto the one planar object and its 2D motion 
parameters. More details appear in [15, Id]. 

The projected 2D image motion (u(z t y),v(x,y) of 
a planar moving object ia the scene can be approxi- 
mated by the 2D parametric transformation of Eq- (6), 
If the support R of this planar object were known in 
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Figure 3: Camera Stabilization. 

a) One OK the frames in the frequence* 

b) The avenge oi two frames, having both rotation and 
translation. The white Unea display the image motion. 

c) The average oi the two frames aitei registration of 
the ahiit. Only effects of camera translation remain. 

d) The average of the two frames after recovering the 
ego-motion, and canceling the camera rotation. This 
results in a stabilized pair of images. 

the image plane, then it would be simple to estimate 
its 2D parametric image motion (ti,v) between two 
successive frames, I(x t y t t) and I(z,y,t + 1). This 
could be done by computing the eight parameters 
(a i b t c t d i e 1 /j^, A) of the transformation (u, v) (see 
Eq. (5)) which minimize the following error function 
over the region of support R [16]: 

(10) 

The error minimization is performed iteratively using 
a Gaussian pyramid [4, 15, 16]. 

Unfortunately, the region of support R of a pla- 
nar object is not known in advance. Applying the er- 
ror minimization technique to the enti re image would 
usually yield a meaningless result. 

This, however, is not true for simple 2D trans- 
lation, where the 2D motion can be expressed by 
(u(x,y),v(s,y)) = (a, d). It Was shown in [5] that 
the motion parameters of a single translating image 
region can be recovered accurately by minimising the 



error function ErrM(a,d) = X^yjf"/* + "i* + A) 2 
with respect to a and d over the entire image (again, 
using iterations on a multiresolution data structure). 
This can be done even in the presence of other mov- 
ing objects in the region of analysis, and with no prior 
knowledge of their regions of support. This object 
is called the dominant translating object, and its 2D 
translation the dominant 2D translation. 

In [15, 16] this method was extended to compute 
higher order 2D motions (2D affine, 2D projective) of 
a single planar object .among dtfferently moving ob- 
jects. A segmentation 6tep, which marks the region 
corresponding to the computed dominant 2D motion, 
was added. This is the region of the dominant planar 
object in the image 

The scheme for locking onto a single planar object 
and its 2D image motion is gradual, where the com- 
plexity of the 2D motion model lb increased in each 
computation step, and the segmentation of the pla- 
nar object is refined accordingly- More details cao 
be found in [16] The 2D motion models used in the 
gradual locking on a planar object are listed below in 
increasing complexity: 

1. Translation: 2 parameters, u(x, y) — a, 
v{x % y) = d This model is applied to the entire 
image to get an initial motion estimation. This 
computation is followed by segmentation to ob- 
tain a rough estimate of the object's location 

2. Affine; 6 parameters, u(x,y) = a + bx + cy. 
v(a=; tf) = d + €X + fy This model is applied only 
to the segmented region obtained in the transla- 
tion computation step, to get an afiine approxi- 
mation of the object's motion. The previous seg- 
mentation is refined accordingly. 

3 A Moving planar surface (a pseudo 2D pro- 
jective transformation); 8 parameters [1, 3] (see 
Eq. (5)), u(x, y) = o + bx + cy + gx 2 + hxy t 
v(x, y) = d + ex + fy + gxy + hy*. This model is 
applied to the previously segmented region to fur- 
ther refine the 2D motion estimation of the planar 
object, and its segmentation 

5 Concluding Remarks 

A method is for computing ego- motion in static 
scenes was introduced- At first, an image region cor- 
responding to a planar surface in the scene is de- 
tected, and its 2D motion parameters between suc- 
cessive frames are computed. The 2D transformation 
is then used for image warping, which cancels the rota- 
tional component of the 3D camera motion for the en- 
tire scene, and reduces the problem to pure 3D trans- 
lation The 3D translation (the FOE) is computed 
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from the registered frames, and then the 3D rotation 
is computed by solving a small set of linear equations. 

It was shown that the ego-motion can be recovered 
reliably in all cases, except for two; The caee of an 
entirely planar scene, and the case of an ego-motion 
with a translation in the x-y plane only. The first case 
cannot be uniquely resolved by humans either, dye to 
a visual ambiguity. In the second case it was shown 
that only the translation parameters of the camera and 
the rotation around its optical axis can be recovered 
accurately. The. panning parameters (rotation around 
the z and y axes) can only be roughly estimated in 
this special case. In all other configurations of camera 
motion the ego- motion can be reliably recovered. 

The advantage of the presented technique is in its 
simplicity, and in the robustness and stability of each 
computational step. The choice of an initial 2D mo- 
tion mode] enables efficient motion computation and 
numerical stability. There are no eevere restrictions 
on the ego-motion or on the structure of the environ- 
ment Most steps use only image intensities, and the 
optical flow is used only for extracting the FOE in the 
case of pure 3D translation, which does not require ac- 
curate optical flow. The inherent problems of optical 
flow and of feature matching are therefore avoided. 
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